sparse dataset
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)
- Research Report > New Finding (0.65)
- Research Report > Experimental Study (0.41)
Data Fusion of Deep Learned Molecular Embeddings for Property Prediction
Appleton, Robert J, Barnes, Brian C, Strachan, Alejandro
Data - driven approaches such as deep learning can result in predictive models for material properties with exceptional accuracy and efficiency. However, in many applications, data is sparse, severely limiting their accuracy and applicability . To improve predictions, techniques such as transfer learning and multi - task learning have been used. T he performance of multi - task learning models depend s on the strength of the underlying correlations between tasks and the completeness of the dataset . S tandard multi - task models tend to underperform when trained on sparse datasets with weakly correlated properties. To address this gap, we fuse deep - learned embeddings generated by independent pre - trained single - task models, resulting in a multi - task model that inherit s rich, property - specific representations. By re - using (rather than re - training) these embeddings, the resulting fused model outperforms standard multi - task models and can be extended with fewer trainable parameters . We demonstrate this technique on a widely used benchmark dataset of quantum chemistry data for small molecules as well as a newly compiled sparse dataset of experimental data collected from literature and our own quant um chemistry and thermochemical calculations.
- North America > United States > Maryland (0.04)
- North America > United States > Indiana > Tippecanoe County > West Lafayette (0.04)
- North America > United States > Indiana > Tippecanoe County > Lafayette (0.04)
- Government > Military (0.68)
- Government > Regional Government (0.68)
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)
- Research Report > New Finding (0.65)
- Research Report > Experimental Study (0.41)
A Comparative Study of Spline-Based Trajectory Reconstruction Methods Across Varying Automatic Vehicle Location Data Densities
Robbennolt, Jake, Munira, Sirajum, Boyles, Stephen D.
Automatic vehicle location (AVL) data offers insights into transit dynamics, but its effectiveness is often hampered by inconsistent update frequencies, necessitating trajectory reconstruction. This research evaluates 13 trajectory reconstruction methods, including several novel approaches, using high-resolution AVL data from Austin, Texas. We examine the interplay of four critical factors -- velocity, position, smoothing, and data density -- on reconstruction performance. A key contribution of this study is evaluation of these methods across sparse and dense datasets, providing insights into the trade-off between accuracy and resource allocation. Our evaluation framework combines traditional mathematical error metrics for positional and velocity with practical considerations, such as physical realism (e.g., aligning velocity and acceleration with stopped states, deceleration rates, and speed variability). In addition, we provide insight into the relative value of each method in calculating realistic metrics for infrastructure evaluations. Our findings indicate that velocity-aware methods consistently outperform position-only approaches. Interestingly, we discovered that smoothing-based methods can degrade overall performance in complex, congested urban environments, although enforcing monotonicity remains critical. The velocity constrained Hermite interpolation with monotonicity enforcement (VCHIP-ME) yields optimal results, offering a balance between high accuracy and computational efficiency. Its minimal overhead makes it suitable for both historical analysis and real-time applications, providing significant predictive power when combined with dense datasets. These findings offer practical guidance for researchers and practitioners implementing trajectory reconstruction systems and emphasize the importance of investing in higher-frequency AVL data collection for improved analysis.
- North America > United States > Texas > Travis County > Austin (0.48)
- Europe > Sweden > Stockholm > Stockholm (0.04)
- Asia > Thailand (0.04)
- Transportation > Infrastructure & Services (1.00)
- Transportation > Ground (0.93)
- Transportation > Passenger (0.68)
Design of an basis-projected layer for sparse datasets in deep learning training using gc-ms spectra as a case study
Chang, Yu Tang, Chen, Shih Fang
Deep learning (DL) models encompass millions or even billions of parameters and learn complex patterns from big data. However, not all data are initially stored in a suitable formation to effectively train a DL model, e.g., gas chromatography-mass spectrometry (GC-MS) spectra and DNA sequence. These datasets commonly contain many zero values, and the sparse data formation causes difficulties in optimizing DL models. A DL module called the basis-projected layer (BPL) was proposed to mitigate the issue by transforming the sparse data into a dense representation. The transformed data is expected to facilitate the gradient calculation and finetuned process in a DL training process. The dataset, example of a sparse dataset, contained 362 specialty coffee odorant spectra detected from GC-MS. The BPL layer was placed at the beginning of the DL model. The tunable parameters in the layer were learnable projected axes that were the bases of a new representation space. The layer rotated these bases when its parameters were updated. When the number of the bases was the same as the original dimension, the increasing percentage of the F1 scores was 8.56%. Furthermore, when the number was set as 768 (the original dimension was 490), the increasing percentage of the F1 score was 11.49%. The layer not only maintained the model performance and even constructed a better representation space in analyzing sparse datasets.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Nevada > Clark County > Las Vegas (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- (2 more...)
Scaling Up Differentially Private LASSO Regularized Logistic Regression via Faster Frank-Wolfe Iterations
Raff, Edward, Khanna, Amol, Lu, Fred
To the best of our knowledge, there are no methods today for training differentially private regression models on sparse input data. To remedy this, we adapt the Frank-Wolfe algorithm for $L_1$ penalized linear regression to be aware of sparse inputs and to use them effectively. In doing so, we reduce the training time of the algorithm from $\mathcal{O}( T D S + T N S)$ to $\mathcal{O}(N S + T \sqrt{D} \log{D} + T S^2)$, where $T$ is the number of iterations and a sparsity rate $S$ of a dataset with $N$ rows and $D$ features. Our results demonstrate that this procedure can reduce runtime by a factor of up to $2,200\times$, depending on the value of the privacy parameter $\epsilon$ and the sparsity of the dataset.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)
Dealing with Sparse Datasets in Machine Learning
This article was published as a part of the Data Science Blogathon. Missing data in machine learning is a type of data that contains null values, whereas Sparse data is a type of data that does not contain the actual values of features; it is a dataset containing a high amount of zero or null values. It is a different thing than missing data. Sparse datasets with high zero values can cause problems like over-fitting in the machine learning models and several other problems. That is why dealing with sparse data is one of the most hectic processes in machine learning.
Yet Another Library for Deep Learning You Should Know About
It has many algorithms, supports sparse datasets, is fast and has many utility functions, like cross-validation, grid search, etc. When it comes to advanced modeling, scikit-learn many times falls shorts. If you need Boosting, Neural Networks or t-SNE, it's better to avoid scikit-learn. While MLPClassifier and MLPRegressor have a rich set of arguments, there's no option to customize layers of a Neural Network (beyond setting the number of hidden units for each layer) and there's no GPU support. While there are already superior libraries available like PyTorch or Tensorflow, scikit-neuralnetwork may be a good choice for those coming from a scikit-learn ecosystem.
Deep Learning with scikit-learn
It has a good set of algorithms, supports sparse datasets, it is fast and has many utility functions, like cross-validation, grid search, etc. When it comes to advanced modeling, scikit-learn many times falls shorts. If you need Boosting, Neural Networks or t-SNE, it is better to avoid scikit-learn. There is MLPClassifier for classification and MLPRegressor for regression. While both have a rich set of arguments, there isn't an option to customize layers of a Neural Network (beyond setting the number of hidden units for each layer).
Chromatic Learning for Sparse Datasets
Feinberg, Vladimir, Bailis, Peter
Learning over sparse, high-dimensional data frequently necessitates the use of specialized methods such as the hashing trick. In this work, we design a highly scalable alternative approach that leverages the low degree of feature co-occurrences present in many practical settings. This approach, which we call Chromatic Learning (CL), obtains a low-dimensional dense feature representation by performing graph coloring over the co-occurrence graph of features---an approach previously used as a runtime performance optimization for GBDT training. This color-based dense representation can be combined with additional dense categorical encoding approaches, e.g., submodular feature compression, to further reduce dimensionality. CL exhibits linear parallelizability and consumes memory linear in the size of the co-occurrence graph. By leveraging the structural properties of the co-occurrence graph, CL can compress sparse datasets, such as KDD Cup 2012, that contain over 50M features down to 1024, using an order of magnitude fewer features than frequency-based truncation and the hashing trick while maintaining the same test error for linear models. This compression further enables the use of deep networks in this wide, sparse setting, where CL similarly has favorable performance compared to existing baselines for budgeted input dimension.
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)